Introduction

In this notebook, we implement a simpler version of the BUC (Bottom up cubing) algorithm.

The original BUC algorithm is a highly optimized and practically efficient algorithm to materialize the entire data cube given a fact table. We instead experiment with a simpler version, called BUC_rec (BUC recursive), based on our teaching slides. The implementation here emphasize more on the conceptual simplicity rather than efficiency.

Once you understand BUC_rec, you may try to understand BUC if you want to learn more by yourself.

The recursive formula used in BUC_rec is:

$$ Cube(R, \{A,B,C,\ldots,M\}) = \bigcup_{i \in \{1, 2, \ldots, m, \text{ALL}\}} Cube(R_i, \{B,C,\ldots,M\}) $$

where $R_i$ is $\pi_{B,C,\ldots,M}(\sigma_{A = a_i}(R))$ and $R_{\text{ALL}}$ is $\pi_{B,C,\ldots,M}(R)$.


In [1]:
import sys
import pandas as pd
import numpy as np


ALL = -1

# DEBUG = True
DEBUG = False

##============================================================

# Data file format: 
# * tab-delimited input file
# * 1st line: dimension names and the last dimension is assumed to be the measure
# * rest of the lines: data values. 
def read_data(filename):
    df = pd.read_csv(filename, sep='\t')
    dims = df.shape[1] - 1 # the last dim is the measure
    return (df, dims)

def dump_input2(input):
    if DEBUG: 
        print("\n.. BUC_rec invoked on:")
        print(input)
        print("......................\n")
        
# helper functions
def project_data(input, d):
    # Return only the d-th column of INPUT
    return input.iloc[:, d]

def select_data(input, d, val):
    # SELECT * FROM INPUT WHERE INPUT.d = VAL
    col_name = input.columns[d]
    return input[input[col_name] == val]

def remove_first_dim(input):
    # Remove the first dim of the input
    return input.iloc[:, 1:]

def slice_data_dim0(input, v):
    # syntactic sugar to get R_{ALL} in a less verbose way
    df_temp = select_data(input, 0, v)
    return remove_first_dim(df_temp)

def output(val):
    print('=>\t{}'.format(val))

In [2]:
data, d = read_data('./asset/a_.txt')
print(d)
data


2
Out[2]:
A B M
0 1 1 20
1 2 1 50
2 1 2 30
3 1 3 40

In [3]:
project_data(data, 0)


Out[3]:
0    1
1    2
2    1
3    1
Name: A, dtype: int64

In [4]:
select_data(data, 1, 2)


Out[4]:
A B M
2 1 2 30

In [5]:
slice_data_dim0(data, 2)


Out[5]:
B M
1 1 50

Now we can implement the buc_rec() algorithm and test it.


In [6]:
# Note that input is a DataFrame
def buc_rec(input):
    # Note that input is a DataFrame
    dump_input2(input)
    dims = input.shape[1]
    
    if dims == 1:
        # only the measure dim
        input_sum = sum( project_data(input, 0) )
        output(input_sum)
    else:
        # the general case
        dim0_vals = set(project_data(input, 0).values)

        for dim0_v in dim0_vals:
            sub_data = slice_data_dim0(input, dim0_v)
            buc_rec(sub_data)
        ## for R_{ALL}
        sub_data = remove_first_dim(input)
        buc_rec(sub_data)

In [7]:
buc_rec(data)


=>	20
=>	30
=>	40
=>	90
=>	50
=>	50
=>	70
=>	30
=>	40
=>	140

With the following pivot table, we can easily see the output is correct (i.e., all the (non-empty) aggregates are computed).

But did you notice anything else interesting?


In [8]:
data.pivot_table(index = ['A'], columns = ['B'], aggfunc = np.sum, margins = True)


Out[8]:
M
B 1 2 3 All
A
1 20.0 30.0 40.0 90.0
2 50.0 NaN NaN 50.0
All 70.0 30.0 40.0 140.0

Exercise

The above implementation is not complete in that the aggregated values are not associated with their "coordinates". We would like to have sth like

1       1       =>      20
1       2       =>      30
1       3       =>      40
1       *       =>      90
2       1       =>      50
2       *       =>      50
*       1       =>      70
*       2       =>      30
*       3       =>      40
*       *       =>      140

Your task is to enhance the implementation of buc_rec so that you can generate the above more readable output.


In [ ]:


In [ ]:


In [ ]:


In [ ]: